https://github.com/rjmaitri/04_Bartolini_Bob_2020.git
Functions and Tidy Data Biol 607 10/4/2020
# One Bootsrap Sample ####
bootstrap <- function(vec){
one_boot<- sample(vec,
size = length(vec),
replace = TRUE)
return(one_boot)
}
bootstrap(c(1,4,5))
## [1] 4 1 4
make sure you are using the function(s) you wrote in #1
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#input vec, R(#of bootstraps), mean
#default to 1k= R
bootstrap <- function(vec){
one_boot<- sample(vec,
size = length(vec),
replace = TRUE)
return(one_boot)
}
boot_mean <- function(vec, R = 1000, fun = mean) {
out <- replicate(R, bootstrap(vec))
fun(out)
}
boot_mean(c(3,4,5,3,2,50),10)
## [1] 11.2
#compare to actual mean of vector
mean(c(3,4,5,3,2))
## [1] 3.4
data.frame(R = 1:10) %>%
rowwise(R) %>%
summarize(boot_mean(c(3,4,5,3,2,50),10, fun = mean))
## `summarise()` regrouping output by 'R' (override with `.groups` argument)
## # A tibble: 10 x 2
## # Groups: R [10]
## R `boot_mean(c(3, 4, 5, 3, 2, 50), 10, fun = mean)`
## <int> <dbl>
## 1 1 11.9
## 2 2 11.9
## 3 3 11.2
## 4 4 10.9
## 5 5 12.5
## 6 6 8.78
## 7 7 12.1
## 8 8 12.7
## 9 9 9
## 10 10 8.98
##with functions
bootstrap <- function(vec){
one_boot<- sample(vec,
size = length(vec),
replace = TRUE)
return(one_boot)
}
boot_mean <- function(vec, R = 1000, fun = mean) {
out <- replicate(R, bootstrap(vec))
fun(out)
}
boot_mean(c(3,4,5,3,2,50),10)
## [1] 10.4
#compare to actual mean of vector
mean(c(3,4,5,3,2))
## [1] 3.4
stats_Bootsfunc <- function(vec, R = 1000, fun = mean){
vals <- replicate(R, bootstrap(vec))
bstraps_mean <- mean(vals)
mean_vec <- mean(vec)
firstquant <- quantile(vals,0.025)
thirdquant <- quantile(vals,0.975)
bias <- mean(vec) - mean(vals)
out <- data.frame(mean_vec = mean_vec,
mean_samp = bstraps_mean,
firstquantile = firstquant,
thirdquantile = thirdquant,
bias = bias)
return(out)
}
stats_Bootsfunc(c(4,5,2,2,3,2,4,16,8,9,9,19,8,32,32,32,32,41,4,8,4,5),1)
## mean_vec mean_samp firstquantile thirdquantile bias
## 2.5% 12.77273 10.27273 2 36.275 2.5
reactable(data.frame(R = 1:100) %>%
rowwise(R) %>%
summarize(stats_Bootsfunc(c(3,4,5,5,5,6,5,10,4,5,4,4,16,8,9,9,19,18,20,21,22,4,8,4,5,5,6,5,6,4),R=100, fun = mean)))
## `summarise()` regrouping output by 'R' (override with `.groups` argument)
FiveThirtyEight keeps a great archive of poll data at https://projects.fivethirtyeight.com/polls/. The presidential general election polling data is freely available at https://projects.fivethirtyeight.com/polls-page/president_polls.csv with question, poll id, and cycle defining a unique poll.
4a. Download and look at the data. Is it long or wide?
library(readr)
pres_poll <- read_csv("president_polls.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## question_id = col_double(),
## poll_id = col_double(),
## cycle = col_double(),
## pollster_id = col_double(),
## sponsor_ids = col_number(),
## pollster_rating_id = col_double(),
## sample_size = col_double(),
## seat_number = col_double(),
## seat_name = col_logical(),
## internal = col_logical(),
## tracking = col_logical(),
## nationwide_batch = col_logical(),
## ranked_choice_reallocated = col_logical(),
## race_id = col_double(),
## candidate_id = col_double(),
## pct = col_double()
## )
## See spec(...) for full column specifications.
reactable(pres_poll)
The presidential polls dataset is wide, as it has two rows dedicated to each polling question.
4b. Get just the polling data for this last week (from 9/29 to today). Filter on start_date. Also filter down to just Biden and Trump (see candidate_name or answer). Extra credit for using {lubridate} for this, but you can just do a messy %in% string match.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(dplyr)
library(readr)
class(pres_poll$start_date)
## [1] "character"
pres_poll$start_date<- as.Date(pres_poll$start_date, format = "%m/%d/%Y")
presPoll_filter <- filter(pres_poll, answer == "Biden" | answer == "Trump")
pres_current <- presPoll_filter[presPoll_filter$start_date >= "2020-09-29" & presPoll_filter$start_date <= "2020-10-10",]
reactable(pres_current)
4c. OK, this is your sample. What’s the bootstrapped average percentage for each candidate for nationwide polls (state == "")? Note, this answer will not match 538 given their weighting by poll trustworthiness.
#filter by president and nationwide
Boot_pct <- pres_current %>%
select(state, answer, pct)
#replace NA's with zeros to keep dpylr happy
vec_3 <- replace(Boot_pct$state, is.na(Boot_pct$state), 0)
data <- data.frame(Nationwide = c(vec_3),
Candidate = c(Boot_pct$answer),
pct = Boot_pct$pct)
#Trump & National Pct filter for bootstrap
trumpbootdata <- data %>%
filter(Candidate == "Trump" & Nationwide == 0)
# SE of the mean ####
# SE of the mean ####
Donald_boot_mean <- sample(trumpbootdata$pct,
size = length(trumpbootdata$pct),
replace = TRUE) %>% mean
#Trump & National Pct filter for bootstrap
Bidenbootdata <- data %>%
filter(Candidate == "Biden" & Nationwide == 0)
# SE of the mean ####
# SE of the mean ####
Biden_boot_mean <- sample(Bidenbootdata$pct,
size = length(Bidenbootdata$pct),
replace = TRUE) %>% mean
4d. What is the average difference between the two candidates by state and national polls? Note, you’ll need to make this a wide data frame to answer! And, well, try the pivot without this advice first, but then….
library(tidyr)
reactable(pivot_wider(pres_current))
BUT - what’s interesting about replicate() is that, if you ask it to turn back raw draws from a random number generator - or anything with more than one value - it gives you a matrix or array.
5a. So, I want you to, using the mean and SD of Biden’s national polling average (you’ll need to calculate it!) from above, simulate 1000 draws from that population with a sample size of 50. What are the dimensions of the object. What are in the rows and columns?
5b. Yuck. Can you turn this into something usable? Say, first make it a tibble or data frame, and then pivot it to long, such that you end up with a column that has an identifier for sim and a column with a single value from that sim?
(Oh, and for all columns, cols = everything())
5c. For each sim, what’s the bootsrapped mean and CI? Plot it! And tell us how often it’s greater than the initial mean. E.C. for the plot showing the stats in order from low to high.
5d. So…. what is that plot showing? What are the concepts involved?
EC 3 bonus point for each awesome quality visualization of the general polling data. There is a LOT there, so look carefully before you leap.